SE Minneapolis , MN 55455 - 0159 USA TR 04 - 025 SUMMARY : Efficiently Summarizing Transactions for Clustering
نویسندگان
چکیده
Frequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining. In recent years, several studies have also extended its application to the transaction (or document) classification and clustering. However, most of the frequent-itemset based clustering algorithms need to first mine a large intermediate set of frequent itemsets in order to identify a subset of the most promising ones that can be used for clustering. In this paper, we study how to directly find a subset of high quality frequent itemsets that can be used as a concise summary of the transaction database and to cluster the categorical data. By exploring some properties of the subset of itemsets that we are interested in, we proposed several search space pruning methods and designed an efficient algorithm called SUMMARY. Our empirical results have shown that SUMMARY runs very fast even when the minimum support is extremely low and scales very well with respect to the database size, and surprisingly, as a pure frequent itemset mining algorithm it is very effective in clustering the categorical data and summarizing the dense transaction databases.
منابع مشابه
Department of Computer Science and Engineering University of Minnesota 4 - 192 EECS Building 200 Union Street SE Minneapolis , MN 55455 - 0159 USA TR 04 - 002 Enhancing location service scalability with HIGH - GRADE
ÄÓ BLOCKIN
متن کاملSE Minneapolis , MN 55455 - 0159 USA TR 08 - 042 Infobionics Server - the next generation database
This paper describes the ‘Infobionics Server’ a next generation database. Also referred to as the ‘Cellular Database Server’, that is based on a novel ‘cellular’ data model.
متن کاملDepartment of Computer Science and Engineering University of Minnesota 4 - 192 EECS Building 200 Union Street SE Minneapolis , MN 55455 - 0159 USA TR 04 - 021 gCLUTO – An Interactive Clustering , Visualization , and Analysis System
Recently published studies have shown that partitional clustering algorithms that optimize certain criterion functions, which measure key aspects of interand intra-cluster similarity, are very effective in producing hard clustering solutions for document datasets and outperform traditional partitional and agglomerative algorithms. In this paper we study the extent to which these criterion funct...
متن کاملSmaller is tougher
Smaller is tougher A.R. Beaber a , J.D. Nowak b , O. Ugurlu c , W.M. Mook d , S.L. Girshick e , R. Ballarini f & W.W. Gerberich a a Department of Chemical Engineering and Materials Science, University of Minnesota, 421 Washington Ave SE, Minneapolis, MN 55455, USA b Hysitron Incorporated, 10025 Valley View Road, Minneapolis, Minnesota 55344, USA c Characterization Facility, University of Minnes...
متن کاملSmall size strength dependence on dislocation nucleation
J.D. Nowak, A.R. Beaber, O. Ugurlu, S.L. Girshick and W.W. Gerberich* Hysitron Incorporated, 10025 Valley View Road, Minneapolis, MN 55344, USA Department of Chemical Engineering and Materials Science, University of Minnesota, 421 Washington Ave SE, Minneapolis, MN 55455, USA Characterization Facility, University of Minnesota, Minneapolis, MN 55455, USA Department of Mechanical Engineering, Uni...
متن کامل